Population differentiated copy number variation between Eurasian wild boar and domesticated pig populations

Sus scrofa is a globally distributed livestock species that still maintains two different ways of life: wild and domesticated. Herein, we detected copy number variation (CNV) of 328 animals using short read alignment on Sscrofa11.1. We compared CNV among five groups of porcine populations: Asian domesticated (AD), European domesticated (ED), Asian wild (AW), European wild (EW), and Near Eastern wild (NEW). In total, 21,673 genes were identified on 154,872 copy number variation region (CNVR). Differences in gene copy numbers between populations were measured by considering the variance-based value \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${V}_{ST}$$\end{document}VST and the one-way ANOVA test followed by Scheffe test. As a result, 111 genes were suggested as copy number variable genes. Abnormally gained copy number on EEA1 in all populations was suggested the presence of minor CNV in the reference genome assembly, Sscrofa11.1. Copy number variable genes were related to meat quality, immune response, and reproduction traits. Hierarchical clustering of all individuals and mean pairwise \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${V}_{ST}$$\end{document}VST in breed level were visualized genetic relationship of 328 individuals and 56 populations separately. Our findings have shown how the complex history of pig evolution appears in genome-wide CNV of various populations with different regions and lifestyles.


Methods
Sample collection. The study population consisted of 328 individuals consist of 130 females and 198 males from 56 pig populations. The whole-genome sequencing (WGS) of wild boar and domesticated breeds were collected from Europe and Asia. 313 genomes were publicly available and sequenced using Illumina paired-end library and from SRA database (Table S1). 15 genomes including 5 Duroc, 5 Woori-Heukdon and 5 Korean Native were newly sequenced in this study. The 15 Blood samples were collected during routine veterinary treatments with the logistical support under the ethical approval of National Institute of Animal Science, Republic of Korea (NIAS20212224). All of the experimental protocols were approved by National Institute of Animal Science, Republic of Korea (NIAS20212224).
Whole genome sequencing. Fifteen genomes including 5 Duroc, 5 Woori-Heukdon and 5 Korean Native were newly sequenced in this study. Blood samples were collected for DNA extraction by Wizard® Genomic DNA Purification Kit (Promega) from National Institute of Animal Science, Rural National Institute of Animal Science, Republic of Korea. Library construction was performed for each individual using 2 μg of genomic DNA with Illumina TruSeq PCR-free (550) Kit. Sequencing was performed to generate 2 × 151 paired-end reads on the Illumina NovaSeq 6000 platform.
Whole genome sequence alignment. After quality control checking of raw reads using FastQC-0.11.8 16 , adapter and low-quality bases of reads were trimmed by Trimmomatic-0.39 17 . After checking the trimming results and quality of trimmed reads, the trimmed reads were mapped using BWA-0.7.17 MEM 18 to reference genome Sscrofa11.1 assembly 19 . The outputs of the sequence alignment map (SAM) were sorted, indexed, and compressed to binary format (BAM) by Samtools-1.9 20 . The duplicates in BAM files were marked using Picard 2.20.2 MarkDuplicates (https:// broad insti tute. github. io/ picard/), and the marked BAM files were used as input for variant calling. The alignment rate, coverage, and mean depth were calculated using Sambamba 21 . www.nature.com/scientificreports/ lated using t-test statistics (< 0.001) and fraction of reads with zero mapping quality (MQ0 < 0.5). The CNVs in unplaced scaffolds were removed. Structural variations including CNV were detected by 'lumpyexpress' command of LUMPY with default parameter 23 . Overlapped copy number variable regions with same type of CNV between results of CNVnator and LUMPY were defined as concordant CNVs in every individual. The chromosomal distribution of the concordant CNVs were compared between male and female, p-arm and q-arm, and among populations. A 50% reciprocal overlap between filtered CNVs was defined as copy number variation region (CNVR) using CNVRuler 24 . CNVRs found in two and more of individuals were used for downstream analysis to minimize false-positive. Copy number of every gene on CNVR were calculated based on aligned read depth and normalized using CNVnator 22 . The normalized copy number of neutral region from diploid autosome was assumed to be 2.0.

CNV
Hierarchical clustering based on CNVR. To cluster individuals according to their CNV similarities, we made a vector representing presence or absence of CNV for each individual of genes on CNVRs. Hierarchical clustering with 1000 bootstrap resampling was performed on these vectors for genes on autosomal CNVR using pvclust with the default option in R 25 . The 'correlation' and 'average' were used as distance measures and the agglomerative method, respectively. The approximately unbiased (AU) p-value was calculated by multiscale bootstrap resampling. The bootstrap probability (BP) p-value was calculated by ordinary bootstrap resampling based on the unweighted pair-group average method (UPGMA).

Copy number variable genes between populations. The normalized copy number of genes on
CNVRs of all individuals was calculated using CNVnator 22 . The normalized copy number of the neutral region from diploid autosome was assumed to be 2.0. V ST of normalized copy number between a pair of populations was calculated as In addition, the V ST of autosomal copy number variable genes were calculated between AD, AW, ED, EW, and NEW. These results were visualized as Manhattan plots using qqman package in R 28 .
One-way ANOVA test on copy number of every genes on autosomal CNVRs were performed on 5 groups including AD, AW, ED, EW, and NEW. As a post hoc test of ANOVA, Scheffe test was performed on genes of which ANOVA resulting p-values was smaller than 0.05. Genes on CNVR which satisfy both upper 1% pairwise V ST and the p-value less than 0.05 of Scheffe test after one-way ANOVA were defined as population differentiated genes. Hypothetical, putative, predicted, or uncharacterized genes, as well as pseudo-genes, were excluded.
Ethics approval and consent to participate. The 15 Blood samples (5 Korean Native Pigs, 5 Woori-Heukdon, and 5 Duroc individuals) were collected during routine veterinary treatments with the logistical support under the approval of NIAS20212224, Republic of Korea. No further ethics permissions were required for this study. All animals were handled in strict accordance with good animal practice.

Results
Sequence alignment, CNV calling and CNVR definition. The coverage and sequencing depth are important for the credibility of CNVs called using the read depth information of short read alignment. Sequence alignment statistics including mapping rate, coverage and mean depth of all samples were summarized in Table S1. In our dataset, the minimum mean depth was higher than 5.06×, and the mean values of alignment rate, coverage, and mean depth of coverage were about 99.3%, 96.4%, and 18.5×, respectively (Table S1). Number of CNVs defined by CNVnator, Lumpy and consensus CNV of the two software was summarized in Table S2. Lumpy called more CNVs especially deletion than CNVnator in most of individuals. After calling and filtering CNVs, genome-wide CNVRs were identified. Chromosome-wise distribution of CNVs and their total length was summarized in Table S3. Among chromosomes, the ratio of total length of CNV to chromosome size were the largest in chromosome Y, followed by chromosome 12 and 6 while the smallest in chromosome 18 followed by 16 and 15. Total length of CNVs were larger in female than male in chromosome 12 and 2, while smaller in chromosome 11, and 16 (Table S4). CNV distribution on p-arm and q-arm were also compared based on centromeric region defined in the reference genome. Most of centromere-defined chromosome had more CNVs on q arm while less CNVs on q arm in chromosome 3, 5 and 9 (Table S5). Distribution of CNVR larger than 100 kb and 500 kb were visualized separately in Fig. 1. Average size of autosomal CNV of AD, AW, ED, EW and NEW were about 51.7, 51.9, 37.9, 26.6 and 9.0 Mbp. Average lengthening and shortening of chromosomal length in each group was summarized in Table S6. There were population specific lengthening and shortening of chromosomal length in chromosome 4-6, 8 and 14-18 (Table S6). All genes on CNVRs with statistics were summarized in Table S7.
Population differentiation based on copy number variable genes. Hierarchical clustering of all individuals was performed on vectors considering the presence or absence of autosomal CNVRs (Fig. S1). Mean pairwise V ST values of breeds including more than one animal were calculated on 315 animals from 43 populations and visualized as a heatmap with hierarchical clustering (Fig. 2). The V ST range is from 0 to 1, with a higher value indicating a larger difference. The pairwise mean V ST values of five groups were as following: AD-ED, 0.009; AD-AW, 0.032; AD-EW, 0.015; AD-NEW, 0.005; ED-EW, 0.012; ED-AW, 0.040; ED-NEW, 0.005; AW-EW,  (Fig. 3). There were some peaks shared by pairs of groups. We suggested the shared peaks between pairs including a same group as the regions with copy numbers distinct from other groups. Then, differences of normalized copy numbers across the five groups were tested using the one-way ANOVA followed by Scheffe test. Among genes of which the p-value was below 0.05, 111 genes of which V ST values in the upper 0.1% of at least a pair of groups defined as copy number variable genes (Table S7). 15 genes were remained after excluding hypothetical, putative, predicted, or uncharacterized genes, as well as pseudo-genes (Table 1). Among these copy number differentiated genes, group-wise average copy number of every 1 kb of EEA1 were visualized in Fig. 4.

Discussion
Since the colonization of wild boar across mainland Eurasia and North Africa within two Mya and domestication started 10,000 years ago, Sus scrofa has been adapted to various environments and human needs. In addition to selection pressure, demographic events such as the bottleneck in the last glacial period about 20,000 years ago and migration following farmers intensified the development of various pig breeds. Furthermore, modern breeding programs have accelerated genomic studies on pigs with the aim of improving their value as a source of meat and model animals. In particular, porcine CNV has been a great subject for studying phenotypic variance, especially in quantitative traits, as it can alter gene dose and expression. Our study analyzed the largest number of Eurasian wild boar and domesticated pigs with two values to measure the differences in copy number between populations. The first was V ST based on variance, and the second was the one-way ANOVA test. Considering both values together, we present the copy number variable regions and compare the copy number between populations. Chromosome-wise distribution of CNVs were compared by population, sex and chromosomal location such as p-arm and q arm separately. The autosomal CNVs covered larger regions in Asian pigs than European pigs which might be results of reference bias of using single reference representing Duroc. On the other hand, we could not observe any consistent effects of sex and chromosomal location on prevalence of CNVs. There would be other multiple genomic features which affect the probability of CNV occurrence. Hierarchical clustering was performed on vectors representing the presence or absence of CNVs on autosomal CNVRs. Some individuals were clustered following their groups while others were not. For example, Pietrain individuals were clustered discordant with their breeds. Actually, variance of copy numbers was highest in Pietrain among breeds with the value (1.06) significantly higher than others, followed by the variance of Meishan (0.33). Thus, both the clustering result and the high variance of copy numbers indicate that the within-variance of Pietrain is higher than other breeds.  www.nature.com/scientificreports/ Whether domesticated or in the wild, most individuals were clustered along their region rather than their way of life. It implies that gene flow between domesticated pigs and wild boar is still occurring in some areas. Even with the separation between domesticated and wild, the impact of artificial selection on porcine CNV may not be large enough to surpass the impact of gene flow between domesticated and wild.
All the Woori-Heukdon (KWH) and Korean native pigs (KNP) were clustered together with Duroc. KWH was developed by crossbreeding of three generations starting from pure Duroc sow and KNP, also called Chookjin-Chamdon. The F1 hybrid sow was crossed with pure Duroc boar, and the F2 hybrid sow was crossed with Duroc boar. Because the breed development was a recent event finished in 2011, the inherited CNV of KWB has been changed a little.
Since the pairwise V ST becomes smaller when V S becomes larger, almost all pairs of breeds with Pietrain of which the variances in copy number was the largest one, had the smallest V ST . In contrast, all pairs with Enshi black pig had the highest V ST . Due to the fact that the distance between breeds in clustering on the heatmap was only measured with the mean value of pairwise V ST , the clustering of breeds was not always concordant with their groups.
Copy number alteration of genes can make drastic change in phenotype by affecting on the expression and the structure of protein. Therefore, the copy number differentiated genes would be suggested as candidate regions of selection. We suggested how CNVs involved in the evolution of each population by considering environmental differences between respective population and functions of copy number differentiated genes.
Polycystic Kidney and Hepatic Disease 1-Like 1 (PKHD1L1) encodes a member of the polycystin protein family containing 11 transmembrane domains. PKHD1L1 has been reported as a candidate gene for variation in pH of pork 29 , which is related to meat color and water holding capacity. The average copy numbers of PKHD1L1 were slightly lost in groups except for NEW, and they were slightly higher in the European than Asian population. This CNV would be a causative variation on the difference in meat color and water holding capacity between populations.
CLEC4E encodes C-type lectin domain family 4 member E protein. The protein, also called Mincle (Macrophage inducible C-type lectin), is an innate immune receptor on myeloid cells sensing pathogens 30 . Since it was first described as a receptor for mycobacterial cell wall glycolipid and cord factor, the role of Mincle in innate immunity against mycobacterial infection has been investigated. Upregulation of Mincle expression in response to mycobacterial infection were observed in mice 31    www.nature.com/scientificreports/ pro-inflammatory responses. In addition to this fundamental role as a receptor, Mincle can act as an immune modulator in different models by either promoting anti-inflammatory cytokines expression or downregulating pro-inflammatory signaling pathways 30,32 . Tuberculosis, mainly caused by mycobacterial infection, is a severe threat to pigs. Wild boar was suggested as a reservoir that maintains and spreads tuberculosis infection 33 . The copy numbers of CLEC4E were lost in domestic groups and NEW while neutral in EW and AW. The higher copy number of the CLEC4E in wild boars may be presented as evidence of adaptation to mycobacterial infection prevalent in the wild environment. The average copy number of early endosome antigen 1 encoding gene, EEA1 in every groups was more than 8.8 (Table 1). These abnormal copy numbers are most likely caused by minor variations in the reference genome. We demonstrated average copy numbers of five groups in genomic regions, including upstream, protein coding, and downstream region of EEA1 in Fig. 3. The average copy numbers in all groups peaked in two regions: 90,227,001-90,240,000 and 90,244,001-90,250,000. Furthermore, the homologous shape of the graphs among all groups also supported the possibility of minor deletion in the reference genome. EEA1 consists of 5′ upstream, 31 exons, 30 introns, and 3′ downstream sequences, and the peak regions covered exons 16-21, 23, 24 and their adjacent introns. The previous gene reconstruction using additional alternate transcripts of pig individuals also improved a model of EEA1 whose model was missed in Ensembl 34 .
The GVIN1, interferon-induced very large GTPase 1, was upregulated in PRRSV-infected porcine alveolar macrophage 35 while downregulated in lungs during bacterial respiratory infection 36 . However, the biological mechanism of GVIN1 expression against infection and the phenotypical effect of deletion in the porcine genome remain poorly understood.
Kojima and Degawa 37 demonstrated that UGT2B31 expression was higher in male pigs when compared to female pigs and that testosterone treatment of castrated boars increased UGT2B31 expression. Considering the above literature and gene expression network, Sahadevan et al. 38 suggested that UGT2B31 could play steroid metabolic roles in porcine androgen/androstenone metabolism. Sabmborski et al. 39 also demonstrated a significant decrease in UGT2B31 expression on day 14 of the pregnant pig. These previous studies continuously identified the role of UGT2B31 in steroid hormone biosynthesis. The copy number of UGT2B31 in EW and NEW groups were significantly gained. Moreover, SC5D is another gene involved in steroid biosynthesis, such that the expression of SC5D was upregulated in the pig ovary during the luteal phase 40 . The copy number of SC5D was significantly different in our rank-and variance-based test, and the average copy numbers were slightly higher in European pigs than in others. Therefore, these steroid syntheses related genes could be suggested as candidates which can make a difference in reproductive traits between porcine populations.
Cytochrome P450 (CYP) is a type of oxygenase. A previous study identified differences in the fatty acid composition of adipose tissues between Korean native and Yorkshire pigs 41 . The significantly higher expression of CYP genes in Yorkshire was presented as the cause of lower arachidonic acid and higher cis-11,14,17-Eicosatrienoic acid, which are responsible for meat flavor. One of CYP isoforms CYP2C36 was also suggested as copy number variable genes in our result. The mRNA levels of CYP2C33, CYP2C49, CYP3A29, and CYP3A46 were reported as significantly different between Meishan and Landrace in 5-months pigs according to their sex 42 . In addition to the different androgen levels, CNV could be suggested as another cause of differential expression of several CYPs. Because CYPs are also important in the drug metabolism of pigs, CNV of CYP should be considered when studying pigs as a model animal for drug metabolism.
There were NEW-specifically duplicated genes such as EFHC1, ZWINT, and ROPN1L, but little was revealed about their function in pig. Moreover, the number of NEW individuals here were only two, which was too few to suppose these genes play important role in evolution of NEW. In addition, Previous studies were not enough to investigate the functional impact of copy number variation of like-genes such as MARCKSL1, HSBP1L1, and NIF3L1 in the pig. Furthermore, the copy number of MARCKS and HSBP1 were not significantly variable in both V ST and the Kruskal-Wallis test. MYO1H had not been reported yet about their phenotype and genomic variation in Sus scrofa.

Conclusions
In this study, we explored copy number variable genes of pig populations and estimated differentiation in copy number of genes on CNVRs between populations. Also, the CNV of Woori-Heukdon was firstly investigated, and it presented similarity of CNV among recently developed breeds and their paternal/maternal populations. Although this is one of the largest porcine CNV studies, the case of minor variants suggested the limitation of CNV calling using NGS read alignment with single reference genome. In further studies, we anticipate that additional high-quality genome assemblies representing various populations, and experimental validation would improve evolutionary insights on the CNV of pigs.

Data availability
The newly generated sequences for 5 Korean Native Pigs, 5 Woori-Heukdon, and 5 Duroc individuals are available from Sequence read archive (SRA) with the Bioproject Accession Number PRJNA843521 (https:// www. ncbi. nlm. nih. gov/ biopr oject/ PRJNA 843521). All other whole genome sequence data are available in NCBI SRA database (Table S1).